Document Structure Analysis Based on Layout and Textual Features

نویسندگان

  • Stefan Klink
  • Andreas Dengel
  • Thomas Kieninger
چکیده

Document image processing is a crucial process in the office automation and begins from the ’OCR’ phase with difficulty of the document ’analysis’ and ’understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense, that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base. Rules can be formulated based on features which might be observed within one specific layout object. But furthermore, rules can also express dependencies between different layout objects. In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (e.g. lists).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Analysis of English and Persian Academic Written Discourses in Human Sciences: An Evolutionary Account

The present paper focused on the sociocultural explanations of rhetorical differences between English and Persian and was based on the contrastive genre analysis of Applied Linguistics research article abstracts in these two languages. The evolutionary nature of research article abstracts was also investigated from 1985 to 2005, in three stages, with a time interval of 10 years. Seventy eight r...

متن کامل

Logical structure detection for heterogeneous document classes

We present a fully implemented system based on generic document knowledge for detecting the logical structure of documents for which only general layout information is assumed. In particular, we focus on detecting the reading order. Our system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The prominent feature of our framewo...

متن کامل

Logical Layout Recovery: approach for graphic-based features

In contrast to the existing approaches for document analysis and understanding this paper represents a system that considers a logical role for graphic content in predominantly textual, born digital PDF documents. This work was inspired by the idea of using structural graphic objects in order to clarify the logical layout even of complex mostly graphic documents. Based on visual cognition, geom...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Discovering Knowledge through Multi-modal Association Rule Mining for Document Image Analysis

The paper introduces a descriptive data mining method to discover knowledge for the task of automatic categorization in document image analysis. We argue that a document image is a multi-modal unit of analysis whose semantics is deduced from a combination of textual content, layout structure and logical structure. So, the method considers simultaneously different modalities of document represen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000